ALICE: An Algorithm to Extract Abbreviations from MEDLINE

نویسنده

  • HIROKO AO
چکیده

Methods: ALICE extracts an abbreviation and its expansion from the literature by using heuristic pattern-matching rules. This system consists of three phases and potentially identifies valid 320 abbreviation-expansion patterns as combinations of the rules. Results: It achieved 95% recall and 97% precision on randomly selected titles and abstracts from the MEDLINE database. Conclusion: ALICE extracted abbreviations and their expansions from the literature efficiently. The subtly compiled heuristics enabled it to extract abbreviations with high recall without significantly reducing precision. ALICE does not only facilitate recognition of an undefined abbreviation in a paper by constructing an abbreviation database or dictionary, but also makes biomedical literature retrieval more accurate. This system is freely available at http://uvdb3. hgc.jp/ALICE/ALICE_index.html. j J Am Med Inform Assoc. 2005;12:576–586. DOI 10.1197/jamia.M1757. It is essential for biomedical researchers to obtain knowledge from the MEDLINE database. However, numerous abbreviations such as gene and protein names, which are routinely used throughout the biomedical literature, hinder its efficient use. Abbreviations in the biomedical literature are highly ambiguous: one abbreviation may represent multiple expansions. For example, Liu et al. point out that 81.2% of abbreviations are ambiguous and had an average of 16.6 meanings. One typical example is the abbreviation PC. It may stand for personal computer, primary care, principal component, prostate cancer, etc. To make matters worse, the increasing number of biomedical papers in the MEDLINE database continues to incorporate new abbreviations into it. A support system is urgently needed to help researchers recognize the expansions of abbreviations. Here, we define abbreviations as ‘‘contractions of words or phrases that are used in place of their full versions’’ (we call these full versions expansions) and acronyms as ‘‘a type of abbreviations made up of the initial letters or syllables of other words.’’ Much effort has been expended to develop methods for extracting abbreviations and their expansions. For example, some algorithms use parentheses ‘‘( )’’ to limit search criteria, while others use both parentheses and cue words such as ‘‘or’’ or ‘‘stands for.’’ Almost all algorithms use heuristic patterns to identify abbreviations and acronyms. To extract expansions, some algorithms use manually constructed heuristic pattern-matching rules, while others use automatically constructed statistical rules. Some heuristic algorithms use shallow parsing. Although some of these algorithms show good results, they have various limitations. For example, when identifying an abbreviation, some algorithms assume that an abbreviation consists only of one word or that it must be enclosed in parentheses. Supposing that the abbreviation is ‘‘AMI,’’ which stands for ‘‘acute myocardial infarction,’’ these algorithms can extract its expansion if the original expression is ‘‘acute myocardial infarction (AMI).’’ However, they cannot extract the expansion of the abbreviation if the original expression is ‘‘AMI (acute myocardial infarction)’’. Moreover, when extracting an expansion, some algorithms assume that the initial letter of an expansion must be the same as that of its abbreviation. This means that these algorithms cannot extract the expansion of the abbreviation ‘‘AW,’’ for example, which stands for ‘‘water activity.’’ Some researchers have noted that rarely occurring abbreviation types (or minor abbreviation types) such as those in the above examples can be safely ignored because minor Affiliations of the authors: Department of Computational Biology, University of Tokyo, Chiba, Japan (HA, TT); Basic Research Laboratory, Kanebo Cosmetics, Inc., Kanagawa, Japan (HA). This work was partly supported by a grant from the Grant-in-Aid for Scientific Research in Priority Areas Genome Information Science, Japanese Ministry of Education, Culture, Sports, and Technology. The authors thank the staff of the Department of Computational Biology, University of Tokyo, and the staff of the Basic Research Laboratory, Kanebo Cosmetics, Inc., for their contribution to this study. The authors are also grateful to Yasunori Yamamoto of the Department of Computer Science, University of Tokyo, for editing the manuscript. Correspondence and reprints: Hiroko Ao, MSc, Department of Computational Biology, University of Tokyo CB01, 5-1-5, Kashiwanoha, Kashiwa-shi, Chiba, 277-8561, Japan; e-mail: . Received for review: 12/02/04; accepted for publication: 04/23/05. 576 AO, TAKAGI, ALICE: An Algorithm to Extract Abbreviations from MEDLINE by gest on Jauary 5, 2016 ht://jam ia.oxfournals.org/ D ow nladed from abbreviation types have almost no impact on performance of an abbreviation extraction system, an abbreviation database, or an abbreviation dictionary. Furthermore, some abbreviations of minor types would possibly be extracted with major types. However, the existing algorithms are insufficient in meeting our goal to filter papers retrieved with a PubMed search. We have been constructing a system to eliminate irrelevant papers for a query gene from PubMed search results. It is called PETER (PubMed Enhancer Toward Efficient Research). We found that when searching for biomedical literature in the MEDLINE database with a PubMed search, researchers are often bothered by the ambiguity of abbreviations, especially those of gene and protein names. To solve this problem, PETER needs an algorithm that can extract all types of abbreviations with their expansions from a target paper on the fly. In this paper, we describe an algorithm called ALICE (Abbreviation LIfter using Corpus-based Extraction). It searches for parentheses and identifies and extracts pairs of abbreviations and their expansions by using heuristic pattern-matching rules. It uses the same strategy used by Yu et al. and Schwartz and Hearst. However, our algorithm uses additional manually expanded patterns, rules, and stop word lists, which are based on thorough investigation and heuristics. ALICE can potentially identify valid 320 abbreviation-expansion patterns as combinations of the rules. They include types that the previous algorithms do not cover; that is, our system overcame the above-mentioned limitations. As a result, ALICE achieved 95% recall and 97% precision on randomly selected titles and abstracts from the MEDLINE database. It indicates that it does not limit the scope of target literature to a specific biomedical research field for better performance. This system can help users construct not only a useful abbreviation database or dictionary, but also a system to retrieve papers from the MEDLINE database such as the PETER system. An abbreviation database or dictionary based on biomedical literature would help biomedical researchers recognize undefined abbreviations in a paper. Background Larkey et al. described an ad hoc algorithm called Acrophile to extract acronyms fromWeb pages. Their approach is based on the use of parentheses, cue words, and ad hoc rules. They tested four different extraction algorithms: Contextual, Canonical/Contextual, Canonical, and Simple Canonical. These algorithms differ from one another in terms of the types of acronyms, forms of expansions, and text patterns of acronym-expansion pairs they can identify. The four algorithms use different clues (e.g., parenthetical expressions, cue words such as ‘‘stands for’’ or ‘‘or’’) to identify acronym-expansion pairs. Acrophile is one of a few systems that can identify acronyms introduced without parentheses. In addition, the Contextual algorithm pays special attention to digits. For example, if an acronym contains ‘‘3M’’ or ‘‘3D,’’ these are replaced with ‘‘MMM’’ or ‘‘three dimensional.’’ Because this system was constructed for Web pages, the performance of the system is not good for biomedical text, based on our preliminary experiment. It cannot extract pairs such as ‘‘14Curea breath test (14C-UBT),’’ ‘‘granule membrane protein-140 (GMP-140),’’ ‘‘fibrinogen (Fg),’’ or ‘‘protein kinase C (PKC).’’ Chang et al. used a supervised machine-learning algorithm to extract abbreviations and their expansions from MEDLINE abstracts. Their approach is based on the use of parentheses and the resemblance to a training set of humanannotated abbreviations. They assumed that an abbreviation was enclosed in parentheses. After scanning a text to find a candidate abbreviation inside parentheses, the system aligns the candidate with the words before the left parenthesis to match as many letters as possible in the two strings. Then, it converts the candidate abbreviation and its optimal alignments from the aligned words into a feature vector. Next, it applies a binary logistic regression classifier to generate a score from the feature vector. The algorithm had a maximum recall of 83% at 80% precision. The drawback is that an abbreviation must be defined within parentheses. Wren and Garner developed a set of heuristics called Acronym Resolving General Heuristics (ARGH) to identify ‘‘acronym-definition pairs’’ in the MEDLINE database. To our knowledge, it is very similar to our approach; however, we could not fully compare their algorithmwith ours because they evaluated ARGH with various rule sets (e.g., ‘‘term consists of one word only’’ and/or ‘‘require first letter match on abbreviation-type acronyms’’), and none of the sets were the same as ours. They used systematic rates of precision and recall (refer to databases entries) and per-identification-event rates of precision and recall (refer to query texts). Although they mentioned that the systematic recall of the algorithm was around 93.0% and its systematic precision was around 96.5%, those are not per-identification-event rates that we used, and the heuristics for valid pairs are very limited as mentioned above. Liu and Friedman proposed an algorithm based on the use of parentheses and statistical rules to extract a set of related terms from the biomedical literature. The system can extract not only abbreviations associated with their corresponding expansions, but also other semantically related terms such as synonyms, hyponyms, etc. This system is one of the systems that can identify synonymous terms besides abbreviations. First, it collects all parenthetical expressions from a large collection of texts. Next, it detects all outer-text strings that share the same inner-text. Then, it derives and assesses a set of pair-wise terms with frequency information. Finally, it separates these terms into a set of abbreviations and their expansions and a set of other related terms. The drawback is that it is not suitable for identifying expansions that occur only once in a text. The recall of the algorithm was around 88.5%, and its precision was 96.3%. Schwartz and Hearst reported a simple algorithm based on the use of parentheses and ad hoc rules for identifying abbreviation definitions in biomedical texts. It extracts short-form, long-form pair candidates from a text and then it identifies the correct long-form among the candidates. Their system has more restrictions on the identifiable abbreviation types than ours. For example, correct short-forms must consist of at most twowords and their lengthmust be two to ten characters; correct long-forms must be adjacent to the short-form (i.e., they do not allow for an offset word between the shortform and long-forms) and include every letter of the short one, etc. They emphasized that their system was highly effective and less specific than other approaches that used carefully crafted rules for biomedical texts, and, above all, it was 577 Journal of the American Medical Informatics Association Volume 12 Number 5 Sep / Oct 2005 by gest on Jauary 5, 2016 ht://jam ia.oxfournals.org/ D ow nladed from extremely simple. The algorithm had a recall of 82% and a precision of 96%. We consider their assumption is insufficient in covering those pairs appearing in the biomedical literature.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE

OBJECTIVE To help biomedical researchers recognize dynamically introduced abbreviations in biomedical literature, such as gene and protein names, we have constructed a support system called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE aims to extract all types of abbreviations with their expansions from a target paper on the fly. METHODS ALICE extracts an abbreviation and ...

متن کامل

Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE

OBJECTIVE The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbrevi...

متن کامل

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the ...

متن کامل

A Method to Retrieve Papers from MEDLINE: PETER System

We attempted to eliminate non-relevant papers from results of PubMed searches for each topic. The system is called PETER (PubMed Enhancer Toward Efficient Research) and it works as follows. 1. get LocusLink IDs manually. 2. collect information of gene names (AKA synonyms) from public databases. 3. make synonym variations automatically. 4. search papers by PubMed with each synonym. 5. extract ti...

متن کامل

Creating an Online Dictionary of Abbreviations from MEDLINE

Design. Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictiona...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005